Linear Regression Fundamentals

Dr. Lucy D’Agostino McGowan

Before You Fit a Model

📖 Understand the content matter
As a statistician I collaborate frequently with subject matter experts to ensure that I understand the context of the problem at hand.

Understand the objective
It is crucial to understand what the objectives are. Ideally, these are set a priori, or if exploratory analyses are being done that is very explicit from beginning to end.

Before You Fit a Model

📏 Understand where the data came from
Was this observational or experimental data? Is any data missing? What are the units? Are there data entry issues?

🧹 Get the data into a tidy, analyzable form
Often we get data in a form that is not easily analyzable. In this class, we will be focusing mostly on statistical methodology once the data is in an analyzable format, but just because it is analyzable doesn’t mean the analysis choice is obvious.

Before You Fit a Model

💃 Determine the appropriate model
In this class we are focusing on Linear Models. Linear models are not always appropriate. You must examine your data to determine whether a linear model is a good choice.

Is a Linear Model Appropriate?

  • Outcome variable, \(y\) is continuous
  • Explanatory variable(s), \(X = \{X_1, ..., X_p\}\) can take any form
  • Observations are independent
  • The residuals are homoscedastic (Equal variance)
  • The residuals are normally distributed
  • The relationship between \(X\) and \(y\) is linear

Least Squares in Matrix Form

The Linear Model

Standard form: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

Where:

  • \(\mathbf{y}\) is an \(n \times 1\) vector of responses
  • \(\mathbf{X}\) is an \(n \times p\) design matrix
  • \(\boldsymbol{\beta}\) is a \(p \times 1\) vector of parameters
  • \(\boldsymbol{\varepsilon}\) is an \(n \times 1\) vector of errors

The Design Matrix

Simple linear regression: \[\mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix}, \quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\]

The Design Matrix

Multiple regression: \[\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}\]

What About Hats?

Hat notation indicates estimates or predicted values

Parameters vs. Estimates:

  • \(\beta_0, \beta_1\) = true (unknown) parameters
  • \(\hat{\beta}_0, \hat{\beta}_1\) = estimated parameters from data

Observed vs. Predicted:

  • \(y_i\) = observed response values
  • \(\hat{y}_i\) = predicted values from model

What is a Residual?

Definition: A residual is the difference between an observed value and its predicted value from the model

Formula:

\[\hat\varepsilon_i = y_i-\hat{y}_i\]

You Try: Calculating Residuals

Problem: Given the regression equation \(\hat{y} = 2.3 + 1.5x\) and the data points below, calculate the residual for each observation:

x y \(\hat{y}\) \(\hat{\varepsilon}\)
1 4.2 ? ?
3 6.8 ? ?
5 11.1 ? ?
04:00

Let’s Try It in R

Problem: Given the same regression equation \(\hat{y} = 2.3 + 1.5x\) and \(y = [4.2, 6.8, 11.1]\), \(x = [1, 3, 5]\):

  1. Put the x values in a design matrix called X and the outcome in a vector called y
  2. Put the coefficients in a vector called beta
  3. Multiply them to get \(\hat{y}\)
  4. Subtract from y to get residuals
04:00

Let’s Try It in R Solution

# 1. Create design matrix X
X <- matrix(c(1, 1, 
              1, 3, 
              1, 5),
            byrow = TRUE, ncol = 2)
# Create y
y <- c(4.2, 6.8, 11.1)

# 2. Create beta vector
beta <- c(2.3, 1.5)

# 3. Calculate y_hat
y_hat <- X %*% beta

# 4. Calculate residuals
residuals <- y - y_hat

residuals
     [,1]
[1,]  0.4
[2,]  0.0
[3,]  1.3

The Goal: Minimize Squared Errors

Sum of squared errors:

\[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \hat\varepsilon_i^2\]

\[\text{SSE} = (\mathbf{y}- \mathbf{X}\boldsymbol\beta)^T(\mathbf{y}- \mathbf{X}\boldsymbol\beta)\]

You Try

Problem: Verify that these two expressions for SSE are the same:

Individual terms: \[\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

Matrix form: \[\text{SSE} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Task: Expand the matrix form and show it equals the summation form

Visualizing Squared Residuals

Visualizing Squared Residuals

Application Exercise

  1. Go to lucy.shinyapps.io/least-squares/.
  2. This shows a scatter plot of 10 data points with a line estimating the relationship between x and y. Drag the blue points to change the line.
  3. See if you can find a line that minimizes the sum of square errors
03:00

Geometric Interpretation of Least Squares

What We’re Really Doing

The regression equation: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

The goal: Find \(\boldsymbol{\beta}\) that makes \(\mathbf{X}\boldsymbol{\beta}\) as close as possible to \(\mathbf{y}\)

Understanding the Vector Space

If we have \(n\) observations, we work in \(n\)-dimensional space

\(\mathbf{y}\) is a vector with \(n\) components (one value per observation)

\(\mathbf{X}\boldsymbol{\beta}\) is also a vector with \(n\) components (one prediction per observation)

Both vectors live in the same \(n\)-dimensional space

What is Column Space?

In words: All possible predictions your model can make

Think of it as: Every combination of your predictor variables

Column Space with Numbers

Your data: \(x = [1, 2, 3]\) with 3 observations

Design matrix: \(\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}\)

Column space contains: All vectors of the form \(\beta_0 \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} + \beta_1 \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}\)

Examples of Vectors in Column Space

Example 1: Choose \(\beta_0=0, \beta_1=2\) gives \([2, 4, 6]\)

In words: Intercept = 0, slope = 2, so predictions are [2, 4, 6]

Example 2: Choose \(\beta_0=5, \beta_1=0\) gives \([5, 5, 5]\)

In words: Intercept = 5, slope = 0, so all predictions equal 5

The Fundamental Problem

Your actual observed data: \(\mathbf{y} = [2.1, 3.9, 5.8]\)

Question: Is there some \(\beta_0, \beta_1\) such that \(\mathbf{X}\boldsymbol{\beta} = \mathbf{y}\) exactly?

In other words: Is \([2.1, 3.9, 5.8]\) exactly equal to \(\beta_0[1,1,1] + \beta_1[1,2,3]\)?

Why We Usually Can’t Hit Exactly

  • Answer: Usually no! Real data has irreducible error
  • Your observed \(\mathbf{y}\) typically doesn’t lie perfectly in the column space*
  • In words: Your data points don’t lie exactly on any straight line

The Geometric Solution

  • Since we can’t hit \(\mathbf{y}\) exactly, let’s get as close as possible
  • Find the point in the column space that is closest to \(\mathbf{y}\)
  • This closest point is the orthogonal projection of \(\mathbf{y}\) onto the column space

The Residual Vector

  • Definition: \(\hat\varepsilon = \mathbf{y} - \hat{\mathbf{y}}\)
  • In words: The difference between what we observed and what we predicted
  • This represents the part of \(\mathbf{y}\) we cannot explain with our model

Key Geometric Property

  • The residual vector is perpendicular to the column space
  • Mathematical notation: \(\hat\varepsilon \perp \text{Col}(\mathbf{X})\)
  • Why this matters: Perpendicularity guarantees we found the closest point

Why Perpendicular Means Closest?

  • Imagine dropping a ball onto a table from above
  • The shortest path is straight down (perpendicular to the table)
  • Any diagonal path would be longer
  • The same principle applies in higher dimensions

From Geometry to Algebra

  • Geometric fact: \(\hat\varepsilon \perp \text{Col}(\mathbf{X})\)
  • This means \(\hat\varepsilon\) is perpendicular to every vector in the column space
  • Since columns of \(\mathbf{X}\) span the column space, \(\hat\varepsilon\) is perpendicular to each column

Mathematical Expression of Perpendicularity

If \(\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_p]\), then:

\(\mathbf{x}_1^T\hat\varepsilon = 0\), \(\mathbf{x}_2^T\hat\varepsilon = 0\), …, \(\mathbf{x}_p^T\hat\varepsilon = 0\)

Stacking these equations gives us: \(\mathbf{X}^T\hat\varepsilon = \mathbf{0}\)

The Normal Equations

Starting from: \(\mathbf{X}^T\hat\varepsilon = \mathbf{0}\)

Substitute \(\hat\varepsilon = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\):
\[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\]

In words: “The residuals are perpendicular to every column of \(\mathbf{X}\)

You Try: Matrix Algebra

Starting with: \(\mathbf{A}\mathbf{x} + \mathbf{b} = \mathbf{c}\)

Solve for \(\mathbf{x}\) step by step:

  1. First, isolate the \(\mathbf{A}\mathbf{x}\) term
  2. Then multiply both sides by \(\mathbf{A}^{-1}\) (on the left!)
04:00

Expanding the Normal Equations

Start with: \(\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\)

Distribute: \(\mathbf{X}^T\mathbf{y} - \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{0}\)

Move the second term to the right side: \(\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}\)

Solving for Beta

We have: \(\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}\)

To solve for \(\hat{\boldsymbol{\beta}}\), multiply both sides by \((\mathbf{X}^T\mathbf{X})^{-1}\):

Result: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

This is the least squares solution!

The Hat Matrix

Definition: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\)

What it does: \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\)

In words: Takes your observed data and produces the closest possible predictions

Nickname: “Puts the hat on \(\mathbf{y}\)” to get \(\hat{\mathbf{y}}\)

Hat Matrix Properties

Symmetric: \(\mathbf{H}^T = \mathbf{H}\)

Idempotent: \(\mathbf{H}^2 = \mathbf{H}\)

You Try: Hat Matrix Property

Verify that: \(\mathbf{H}^2 = \mathbf{H}\)

Hint: Substitute the definition \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) and multiply it out

Remember: \((\mathbf{A}^{-1})^{-1} = \mathbf{A}\)

03:00

Why “Least Squares”?

We minimize: \(\sum_{i=1}^n (y_i - \hat{y}_i)^2\)

In words: Sum of squared differences between observed and predicted

This equals: \(||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||^2\)

In words: Squared distance between the observed vector and prediction vector

Key Geometric Insights

1. Regression is projection: Finding the closest point in the column space to \(\mathbf{y}\)

2. Orthogonality is key: Residuals perpendicular to column space guarantees minimum distance

3. Hat matrix is the projection operator: \(\mathbf{H}\) projects onto column space

You Try

Try this in R:
1. x <- 1:10; y <- 2*x + rnorm(10)
2. X <- cbind(1, x) # Design matrix
3. H <- X %*% solve(t(X) %*% X) %*% t(X) # Hat matrix
4. Check: all.equal(H %*% H, H) # Verify idempotent
5. e <- y - H %*% y # Residuals
6. Check: t(X) %*% e # Should be approximately zero

04:00

The Big Picture

Least squares isn’t just algebra - it’s geometry

We’re finding the best approximation to our data within our model’s constraints

The mathematical formulas follow naturally from geometric principles